“Calhoun 


Institutional Archive of the Naval Postgraduate School 





Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


2015-09 


Density deconvolution with epi-splines 


Carbaugh, James C. 


Monterey, California: Naval Postgraduate School 
http://hdl.handle.net/10945/47236 


This publication is a work of the U.S. Government as defined in Title 17, United 
States Code, Section 101. Copyright protection is not available for this work in the 
United States. 


Downloaded from NPS Archive: Calhoun 


Calhoun is the Naval Postgraduate School's public access digital repository for 


\§ D U DL EY research materials and institutional publications created by the NPS community. 
«iis Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS's first 


NY KNOX appointed -- and published -- scholarly author. 


LIBRARY Dudley Knox Library / Naval Postgraduate School 
411 Dyer Road / 1 University Circle 


http://www.nps.edu/library Monterey, California USA 93943 





NAVAL 
POSTGRADUATE 
SCHOOL 


MONTEREY, CALIFORNIA 


THESIS 


DENSITY DECONVOLUTION WITH EPI-SPLINES 
by 
James C. Carbaugh 


September 2015 


Thesis Co-Advisors: Johannes O. Royset 
Wei Kang 
Second Reader: Samuel E. Buttrey 





Approved for public release; distribution is unlimited 


THIS PAGE INTENTIONALLY LEFT BLANK 


REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188 


Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, 
searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments 
regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Washington 
headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and 
to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 


1. AGENCY USE ONLY (Leave Blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED 
09-2015 Master’s Thesis 07-08-2013 to 09-25-2015 


4. TITLE AND SUBTITLE 5. FUNDING NUMBERS 
DENSITY DECONVOLUTION WITH EPI-SPLINES 


6. AUTHOR(S) 
Carbaugh, James C. 


. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION REPORT 
NUMBER 
Naval Postgraduate School 
Monterey, CA 93943-5000 


. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING / MONITORING 


AGENCY REPORT NUMBER 
N/A 


11. SUPPLEMENTARY NOTES 


The views expressed in this document are those of the author and do not reflect the official policy or position of the Department of 


Defense or the U.S. Government. IRB Protocol Number: N/A. 


12a. DISTRIBUTION / AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE 


Approved for public release; distribution is unlimited 


13. ABSTRACT (maximum 200 words) 


Analysts tasked with developing probability density estimates may obtain data in sets of varying quality and quantity. Often low- 
fidelity data contaminated with measurement error, or “noise,” may be abundant, but the cost of obtaining data free of these errors 
will limit the amount of high-fidelity data available. In such a scenario, the problem is to identify an estimate of a probability density 
function given scarce high-fidelity observations, knowledge of measurement errors, abundant “noisy” data, and available user knowl- 
edge of the density apart from empirical data. We solve this rich class of deconvolution problems through constrained optimization 
with first-order epi-splines, which are used for the first time to approximate densities to an arbitrarily high level of precision. We limit 
our scope to univariate densities where measurement errors are homoscedastic. Demonstrations come from a benchmark from the 
literature, historical medical data, and a scenario in uncertainty quantification in fluid dynamics. Results show that deconvolution via 
epi-splines is viable, comparable with a widely available deconvolution method, and shows potential for savings in resource budgets 
for data collection. 


14. SUBJECT TERMS 15. NUMBER OF 
Deconvolution, Probability Density Estimation, Epi-splines, Optimization, Functional Approximation, Non- PAGES 73 


Parametric Statistics, Uncertainty Quantification 16. PRICE CODE 


17. SECURITY 18. SECURITY 19. SECURITY 20. LIMITATION OF 
CLASSIFICATION OF CLASSIFICATION OF THIS CLASSIFICATION OF ABSTRACT 
REPORT PAGE ABSTRACT 


Unclassified Unclassified Unclassified UU 


NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) 
Prescribed by ANSI Std. 239-18 





THIS PAGE INTENTIONALLY LEFT BLANK 


il 


Approved for public release; distribution is unlimited 
DENSITY DECONVOLUTION WITH EPI-SPLINES 


James C. Carbaugh 
Lieutenant, United States Navy 
B.S., U.S. Naval Academy, 2007 


Submitted in partial fulfillment of the 
requirements for the degrees of 


MASTER OF SCIENCE IN OPERATIONS RESEARCH 
and 
MASTER OF SCIENCE IN APPLIED MATHEMATICS 


from the 


NAVAL POSTGRADUATE SCHOOL 


September 2015 
Author: James C. Carbaugh 
Approved by: Johannes O. Royset, Ph.D. 


Thesis Co-Advisor 


Wei Kang, Ph.D. 
Thesis Co-Advisor 


Samuel E. Buttrey, Ph.D. 
Second Reader 


Patricia Jacobs, Ph.D. 
Chair, Department of Operations Research 


Craig Rasmussen, Ph.D. 
Chair, Department of Applied Mathematics 


ill 


THIS PAGE INTENTIONALLY LEFT BLANK 


iV 


ABSTRACT 


Analysts tasked with developing probability density estimates may obtain data in sets of 
varying quality and quantity. Often low-fidelity data contaminated with measurement er- 
ror, or “noise,” may be abundant, but the cost of obtaining data free of these errors will 
limit the amount of high-fidelity data available. In such a scenario, the problem is to iden- 
tify an estimate of a probability density function given scarce high-fidelity observations, 
knowledge of measurement errors, abundant “noisy” data, and available user knowledge of 
the density apart from empirical data. We solve this rich class of deconvolution problems 
through constrained optimization with first-order epi-splines, which are used for the first 
time to approximate densities to an arbitrarily high level of precision. We limit our scope to 
univariate densities where measurement errors are homoscedastic. Demonstrations come 
from a benchmark from the literature, historical medical data, and a scenario in uncertainty 
quantification in fluid dynamics. Results show that deconvolution via epi-splines is viable, 
comparable with a widely available deconvolution method, and shows potential for savings 


in resource budgets for data collection. 
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Executive Summary 





Analysts faced with scarce budgets for experimentation and probability density 
estimation must also account for the measurement error, sometimes called “noise,” inherent 
in the data they acquire. Additionally, the cost to acquire data and the fidelity of the data 
obtained are often positively correlated. Scientists in the Department of Defense (DOD) use 
simulations to great effect as a resource saving measure, but simulations can vary widely 


in accuracy and computation time. 


Our goal is to generate accurate univariate probability density estimates that can 
blend high-fidelity and low-fidelity data, account for noise, and make use of all available 
information about a given system. With inputs such as prior knowledge of measurement 
error, noisy and accurate data, and knowledge of the density apart from empirical observa- 
tions (1.e., “soft” information), we address this rich class of deconvolution problems using 


constrained optimization with first-order epi-splines. 


Density deconvolution reduces the noise inherent in estimates generated from ob- 
servations contaminated with errors. Epi-splines have been shown to approximate density 
functions to an arbitrarily high level of accuracy. This thesis blends deconvolution and epi- 
splines, and is the first to use first-order epi-splines, which allow for closed form solutions 
of the convolution integral when errors are homoscedastic. First-order epi-splines also pro- 


vide a convenient framework for imposing shape constraints generated by soft information. 


We explore three cases to demonstrate our method. First, we show evidence of 
the accuracy of epi-spline deconvolution starting from a benchmark in the deconvolution 
literature. Second, we show that we can obtain deconvolution results comparable with a 
widely available software package. In an uncertainty quantification example from fluid dy- 
namics, we produce density estimates blending high-fidelity and low-fidelity data showing 
that we can supplement small accurate data sets with abundant noisy data sets to produce 
density estimates comparable with those obtained exclusively by abundant accurate data. 
The result is that we can achieve dramatic reduction in computational expense with po- 
tential for further savings in other applications where the observations in a noisy data set 


contain identical and independently distributed errors. 
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CHAPTER 1: 
Background 





Measurement errors are inherent in all empirical data. When measurement errors are sig- 
nificant, generating estimates based on contaminated data may lead to poor results. This 
measurement error, often referred to as “noise,” may be addressed by the application of a 
technique called deconvolution. Figure 1.1 shows a visualization of deconvolution in the 


broadest sense. 


Figure 1.1: Deconvolution 


Noisy 
Data 


Error 


Information 





The process known as deconvolution produces an esti- 
mate of a true function of interest. Methods and appli- 
cations vary widely. Knowledge of errors is not always 
assumed. 


We demonstrate a new method of deconvolution, that fuses hard data, “soft” in- 
formation, and knowledge of the distribution of noise, to generate a probability density 
function for a random variable of interest. Hard data include noisy (and possibly some 
noise-free) empirical data. “Soft” information is defined as available user-provided sys- 
tem knowledge, such as continuity, monotonicity, tail convexity, or bounds on the density. 
We limit our scope to the estimation of univariate probability density functions where the 
noise is homoscedastic. This process is formulated as a constrained infinite-dimensional 


optimization problem, that is solved through an approximation by means of epi-splines. 


Several distinct data sets provide opportunities for demonstration. 


We continue this chapter with a brief literature review of the topics of decon- 
volution, the additive measurement model, and an introduction to the nascent application 
of epi-spline technology. The modern applications in the field of deconvolution may be 
subdivided, but not limited to, three main categories: signal processing, image processing, 
and probability density estimation. Epi-spline technology has also been used in probability 
density estimation, but this thesis represents the first use of first-order epi-splines. A quali- 
tative overview of these topics is given to the reader in order to understand the significance 
of the various techniques. 


1.1 Deconvolution 
Deconvolution is the process by which we obtain a solution for f in the convolution of f 


and g defined as: _ 
(Fes) =f fe-a)e(e)de (a) 


In Equation (1.1), we say that f and g are convolved. Deconvolution is well-known as an 
ill-posed problem and falls into a larger class of inverse problems. For a tutorial review of 


the deconvolution problem, see [1]. 


1.1.1 Signal Processing 

The application of deconvolution to gain inference on an input signal from one distorted 
by noise or some impulse response has one of its earliest applications in exploratory geo- 
physics. In this field, a shot is fired into the ground and echoes as it passes through sedi- 
mentary layers of the earth. The echoes of the shot are recorded by a receiver and analyzed 


to make predictions about the various underground layers of the earth. 


Enders Robinson’s 1954 dissertation, and related paper, explain the use of time 
series decomposition of wavelets to extract a filtered time signal [2], [3] and expands on 
earlier work in statistics to use Fourier Transform (FT) methods for the analysis of station- 
ary time series. These papers predate the use of the term deconvolution to explain such a 


process of extracting a time series signal filtered of errors, but they serve as an excellent 


introduction to the application of deconvolution in geophysics. Further work on deconvo- 


lution in geophyics is given in [4]-[8]. 


Deconvolution in signal processing is not limited to geophysics. Since the intent 
is to extract a more accurate time series signal from a signal convolved with noise, the 
applications are vast. A seminal paper on the use of Fourier techniques in deconvolution 
for spectroscopy is given in [9]. Further treatment on deconvolution in spectroscopy may 
be found in [10]. Deconvolution has also been used widely in pharmacology to understand 
the effects of various substances on test subjects [11], [12]. Whereas in geophysics, a shot 
may be fired into the ground, in pharmacokinetics, a signal is generated by the intake of a 


certain drug. 


1.1.2 Blind Deconvolution 


Deconvolution refers to the process of recovering an input function or impulse function 
given a known output. Deconvolution may be subdivided into two different kinds of prob- 
lems: one where the input function is accessible in some form, and one where it is not [13]. 
The scenario in which the input is unknown is known as blind deconvolution. Blind decon- 
volution presents a much more challenging problem than deconvolution in its traditional 


form since the input is unknown. 


The adjective “blind” in the term blind deconvolution refers to the unknown na- 
ture of the input signal. One major application of blind deconvolution algorithms is in 
image processing. In this field, an image may be blurred noise taken to be from a point 
spread function (i.e., distortion with a mathematical description). To compute the true im- 
age, it is necessary to estimate the point spread function. Once an estimate is achieved, 
an improved picture follows. The widespread proliferation of digital photography has sig- 
nificantly raised the visibility of this field [14]. A thorough review of blind deconvolution 
techniques for imagery is given in [14]. Helpful visualizations of the performance of blind 


deconvolution techniques are given in [15]. 


1.1.3. Density Deconvolution 

Density deconvolution refers to the application of deconvolution techniques in order to per- 
form inference on a density of an input random variable given observations of an output 
variable. Density deconvolution is more restrictive than other deconvolution methods; the 
estimate of the density function of interest produced by the deconvolution method must ad- 
here to axioms of probability such as nonnegativity and integration to one. A deconvolution 
method that uses FT may be highly accurate over large portions of the domain, but violate 
nonnegativity in the tails. The density estimate must then be amended in order to be an 
actual probability density function. In Chapter 2 we demonstrate how an epi-spline frame- 
work can use constraints to ensure whatever estimate is found via deconvolution meets the 


requirements of a density function. 


1.2 Additive Measurement Model 


Nonparametric deconvolution problems in statistics often involve density estimation of un- 


observed variables in measurement models of the form 
Y=X+e (1.2) 


where the problem is to identify the density of X given n independent and identically dis- 
tributed (iid) observations of Y. In this case, each observation of X and € are unknown, 
but the density of the errors € may be assumed based on prior knowledge. Such a model 
is often described as the additive measurement model [16]. Within such a framework the 
literature is often focused on the nature of the error density [17]-[19], as well as rates of 
convergence [20]-[22]. For a detailed tutorial on density deconvolution; see [16]. Fora 


comprehensive treatment of measurement error; see [23]. 


The R [24] package decon contains an example FT and kernel methods used in 
deconvolution [25]. The supporting literature provides a brief survey of fields in which 
measurement error may be significant, including medicine, bioinformatics, chemistry, as- 
tronomy, and econometrics, as well as an extensive review of kernel based methods and 
bandwidth selection methods for several cases. The authors compute density estimates 


using the Fast Fourier Transform (FFT) instead of the definitions provided and vastly re- 


duce their computation time. Functions included in the R package decon can handle both 


Gaussian and Laplacian errors, as well as homoscedastic and heteroscedastic errors [25]. 


1.3. Epi-Splines 


Royset and Wets have addressed the problem of “how to identify a function that best repre- 
sents data and also satisfies information-driven constraints” [26] with the use of epi-splines. 
These epi-splines are piecewise polynomial functions described by a finite number of pa- 
rameters [27]. By partitioning the domain of a function into a finite number of segments, 
epi-splines allow for the estimation of a function by selecting certain epi-parameters that 
define the estimate of the function in every segment. A solution of a functional identifica- 
tion problem generated by epi-splines will contain parameters such as slopes and intercepts 
in a first-order epi-spline setup or coefficients of polynomials more generally. Epi-splines 
have been used for a growing variety of applications, including financial tools, electricity 


demand forecasts, and probability density estimation [26]-[30]. 


This thesis shows the first computational use of first-order epi-splines. It also 
expands upon existing methodology with the addition of a convolution constraint in order 
to estimate the density of the input variable, X. First-order epi-splines allow for closed- 
form expressions of the convolution integral. These expressions preclude the necessity of 
numerical integration and the approximations and increases in computational cost that are 
unavoidable with numerical methods. The use of convolution as a constraint is explained 
fully in Section 2.3.3. 


By combining an objective function such as maximum likelihood with constraints 
for convolution and soft information, we can deconvolve a probability density function 
generated from noisy data. At a minimum, the estimate of the probability density function 
must be entirely nonnegative and integrate to one. Additional user information may include 
support, continuity, differentiability, convex tails, among other characteristics. We explain 


this process completely in Chapter 2. 


1.4 Overview 

This thesis shows that constrained optimization using epi-spline technology may be a vi- 
able alternative for estimating the density of X in the additive measurement model of Equa- 
tion (1.2) with Gaussian noise. One benefit of this approach is that we can use a sequence 
of constrained optimization models to combine information gained from both an abundant 


and noisy dataset and a scarce but accurate dataset. 


We make our methodology explicit in Chapter 2, detailing steps for data pro- 
cessing, identifying objective functions, and listing all possible constraints of a given con- 
strained optimization problem. In Chapter 3, we apply this method to an artificial data set 
of the convolution of a gamma density and Gaussian noise. Next, in Chapter 4, we compare 
our method with a widely available method on a medical data set of systolic blood pressure 
measurements. Finally, we show how this method may be used to fuse data from high- 
fidelity and low-fidelity simulations in an uncertainty quantification (UQ) scenario from 
fluid dynamics in Chapter 5. We conclude our findings in Chapter 6. 





CHAPTER 2: 
Methodology 





2.1 Constrained Optimization Problems 


We begin from the density estimation method of [27]. We consider an iid sample See Le 


of a random variable X, and a set F of lower semicontinuous (lsc) functions that are non- 
negative, integrate to one, and satisfy soft information about the density. The maximum 
log-likelihood problem (MLP) becomes 


MLP: max y log(f(x')) s.t. fe F (2.1) 
i=1 


where f satisfies additional criteria such as continuity, unimodality, tail convexity, etc., as 


well as a convolution constraint. The maximum entropy problem (MEP) becomes 
MEP: max— f f(x) log(f(x))dx s.t. f € F. (2.2) 


Given the additive measurement model shown in Equation (1.2), we denote the true density 
by fx, a contaminated density by fy, and density of homoscedastic measurement error, 
by fe. The goal is to deconvolve the true density fy from the contaminated density fy 
given knowledge of f- and soft information. We give a visual depiction of this process in 


Figure 2.1. 


2.2 First-Order Epi-Splines 


Following the setup in [26], we define first-order epi-splines. Given a closed interval [/, u] 











in R, we impose an evenly spaced mesh m = {m, | k =0,1,...N} where mp = / and my = u. 





An epi-spline density f is given with epi-parameters slope a* and intercept ak as 


f(x) =ab+akx for x € (mp1, mg) (2.3) 


and for every x € |/,u], has 
f (x) = min{ tim fet) Him Fox—0) (2.4) 


The second condition ensures that the value f(x) is the smaller value of the left and right 











limits of the density at x. Figure 2.2 shows an epi-spline of order one on R, that is piecewise 





affine and allows for jumps at mesh points [26]. 


Epi-splines are dense in the space of Isc functions under the epi-topology [27]. 
Thus, if the mesh is sufficiently fine, a first-order epi-spline can approximate to an arbi- 
trary level of accuracy any Isc function [26]. Constrained optimization via epi-splines is 


therefore appropriate for estimation. 


Figure 2.1: Deconvolution with Epi-Splines 
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Producing a deconvolved density fy requires knowledge of a prior noisy density, fy, soft 
information and knowledge of measurement error density. Observations of X are not nec- 
essary for estimating fy. They may be used if available. 


2.2.1 First-Order Epi-Spline Optimization Problems 


For computational purposes, we employ epi-splines to find approximate solutions of the 


!..,x”, we use a max- 


R2N 


given optimization problems. In the presence of n observations x 











imum log-likelihood objective function to obtain a density fit. Given a set A C 





of epi-parameters that ensure a resulting epi-spline density is nonnegative, integrates to 


one, and satisfies soft information, the first-order epi-spline maximum log-likelihood prob- 
lem (MLP-E) becomes 














MLP-E: max "log (qf Ki 4 ghz!) s.t.aeA CR, where k; s.t. x! € (mp_1,mx) (2.5) 


akak i=1 


and the solution of epi-parameters a = (a), a a, eer aby , a’) satisfies additional criteria 
such as constraints described in Section 2.3. We have an objective function that is concave 
in the epi-parameter. By the inclusion of the logarithm operator, the domain of the objective 


function is restricted to strictly positive values of the density at any interval. 


When observations of X are not available, we can use a maximum entropy for- 


mulation. The first-order epi-spline maximum entropy problem (MEP-E) takes the form 














MEP-E: max— Pf (ak +a*x) log(a +a‘x)dx s.t.a€ ACR. (2.6) 
apa M1 
Since the entropy is composed with an affine function, and the density is entirely nonnega- 


tive, we have a concave objective function in the epi-parameter. 


The density estimate solution of MLP-E or MEP-E, that we denote by fx, is then 
found by optimizing over 2N parameters. Though we focus on the fy estimation problem 
here, we estimate fy in a similar manner, but without a convolution constraint or knowledge 


of measurement error. 


2.3. Constraints 


Detailed formula for our constraints are given in this section. Integration to one and non- 
negativity are required for all densities and are described in 2.3.1 and 2.3.2. Convolution, 


described in 2.3.3, is required for the estimate of fx only. 


Figure 2.2: First-Order Epi-Splines 
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We show a mesh and epi-spline on R. The epi-spline is defined 
by the epi-parameters slope and intercept on each interval. The 
value f(x) is the smaller of the left and right limits at each mesh 
point. 





2.3.1 Integrate to One 
Densities must integrate to one. With a first-order epi-spline solution this integration takes 


the following form: 


y / ab +akxdx = 1 (2.7) 
k=1 7 Mk-1 
It follows that: 
N ; Nk ; ; 
Yi apg — me_—1) + YY (my — mg_ 1) = 1 (2.8) 
k=1 k=1 2 


is affine in the epi-parameter. 


2.3.2 Nonnegativity 
By constraining the function to be nonnegative at the endpoints of each interval, we ensure 


nonnegativity throughout the domain of the function. With the following pair of inequalities 
ak+akmp_1 >0 Vk (2.9) 
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ajtakm>0 Vk (2.10) 


we obtain halfspaces with respect to the epi-parameters. 


2.3.3 Convolution 
Recall that in the additive measurement model, the density of the noisy variable, fy, is the 
convolution of the density of the true variable, fy, and the measurement error density, fe. 


Given a noisy density, fy, and knowledge of fe, we write the convolution equation as 
fry) = f felr—a)fx)de vy. 2.11) 


Having already computed an estimate fy, we can work to identify an estimate fy such 
that fy =i fx * fe). We consider cases where the noise is Gaussian with known mean and 
variance, {1 and 0%, respectively, and fy is approximated via first-order epi-splines. For 
computational purposes, we ensure that the right hand side of Equation (2.11) is close to 


fy with a tolerance 6 and the inequality 








N Mm 1 ( 2 2 
A —(y-x—p) [20 kjk 
RY) i) e (ak +akx,)dx | <6 (2.12) 
py m1 OV 20 

for a finite number of y values evenly spaced within the interval |[mo,my]. With Gaussian 
measurement errors, we can obtain the closed form solution of these integrals. A nonzero 
6 is meaningful since we are working with estimates fr, fy, and assumed knowledge of 
error fg, and there is uncertainty and possible errors in all three of these densities. 


2.3.4 Density Continuity 


Continuity may be required by the user. To ensure continuity, we set 


k+1 


ah +a‘m, =d taktlm 


yet We=1js3N=—1 13) 


which is affine with respect to the epi-parameters. 
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2.3.5 Non-increasing or Non-decreasing 


Soft information may determine that the density of interest is nonincreasing. Since the 
density is not necessarily continuous, we use two constraints to establish a non-increasing 
function. First, we must have nonpositive derivatives, and secondly, we ensure that the 
function evaluated at the upper end of each interval is greater than or equal to the func- 
tion evaluated at the lower end of the subsequent interval. We can use the following two 
inequalities. 

ak <0 Vk (2.14) 


ah +akm > a tat, Vk (2.15) 


These result in halfspaces with respect to the epi-parameters. We reverse the inequalities 


for non-decreasing estimates. 


2.3.6 Unimodal 


If soft information indicates the density is unimodal, the user can specify an interval of 
inflection points that contains the mode. We can achieve a unimodal estimate by requiring 
the function to be nondecreasing to the left of the left inflection point, J, nonincreasing to 
the right of the right inflection point Jy, and requiring nonincreasing derivatives between the 


inflection points. With first-order epi-splines, five constraints are required for unimodality. 


a®>0 for all k with my < I, (2.16) 

ay taking < apt! +a‘*!m, for all k with my < Ir (2.17) 
a’ >a‘! for all k with my > I, and my_1 < Iy (2.18) 
a’ <0 forall k with m_, > Iy (2.19) 

ah taking > apt! +a**!m, for all k with my_1 > Iy (2.20) 


We find that the unimodal constraints are affine. 
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2.3.7 Tail Convexity 
We can impose tail convexity for a function by ensuring that consecutive slopes are in- 
creasing outside of the inflection points. We introduce the following constraints enforced 
in addition to continuity: 

ak <a! Vk with m < Ir (2.21) 


ak <a! Yk with mpe_1 > Iv (2.22) 


to ensure convexity in the tails. 


2.3.8 Maximum Change in Gradient 
Since we have a first-order epi-spline framework, this epi-spline estimate is not differen- 
tiable. We can incorporate a certain “smoothness” by introducing a maximum change in 
slope, A, where 

lata") <A Ve=1,...,.N-1. (2.23) 


2.4 Comparisons 
We describe the method by which we measure the accuracy of our estimates numerically. 


When the true density is known a mean squared error (MSE) is computable as 


b 
MSE = | (fie (x) — fi (x))2 fix (x). (2.24) 
which we integrate numerically via the midpoint rule. 


For visual comparisons, we calculate kernel smoothing density estimates on cer- 
tain data. Kernel estimates are generated using the standard kernel density estimator in 
MATLAB [31], a Gaussian kernel, and the default bandwidth calculation. 


2.5 Procedures 
We describe procedures for obtaining density estimates. Given an iid sample of Y, we 


obtain a solution to MLP-E, fy, imposing constraints according to soft information. From 
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this solution, we record values of fy at evenly spaced intervals for use in the convolution 


constraint. 


If an iid sample of X is available, we can obtain a solution of MLP-E given soft 
information and knowledge of measurement error. If no samples of X are available, we 
can obtain fy as the solution to MEP-E. In addition to adhering to constraints imposed 
by soft information, MLP-E and MEP-E is constrained in estimates of fy by convolution 
as described in Section 2.3.3. Chapter 3 and subsequent chapters will demonstrate how 


increasing soft information can improve estimates. 
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CHAPTER 3: 


Gamma-Benchmark 





3.1 Setup 
We demonstrate the effectiveness of our model in simulations motivated by test instances 


in [32]. We consider an additive measurement model scenario where measurement errors 


have a normal density, fe, and fx is a Gamma density. The convolution of these densities 


is fy. 


3.1.1 Data and Noise 


We generate 5000 samples of a random variable from a Gamma density with a certain 


shape, a, and scale, B, so that 


X ~ Gamma(a =5,B =1) (3.1) 
and 5000 measurement errors 
€ ~ Normal(u = 0,0 = V3.2). (3.2) 


Thus we record 5000 observations of Y with each observation being the sum of a single 
observation of X and a single observation of €. A histogram of the observations of Y is 
given in Figure 3.1 and summary statistics for these observations is given in Table 3.1. We 
retain no knowledge of X or € observations for density estimation but provide histograms 
of X and € for illustrative purposes only in Figure 3.1. Separately, three observations of X 


are available for density estimation. 


Table 3.1: Summary Statistics of Y Observations 





Minimum | Ist Quartile | Median | Mean | 3rd Quartile | Maximum 
—2.832 2.98 4.784 | 4.972 6.731 21.640 
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Figure 3.1: Sums of Gamma and Gaussian Observations 
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Observations 


We generate 5000 observations of Y from the sums of 5000 samples of X and 
5000 samples of €. The blue histogram shows the frequency of Y and is filled to 
indicate the user’s knowledge of these observations. The red and black outlined 
histograms show the frequency of the X and € observations, respectively, that are 
unknown to the user. 


3.1.2 Soft Information 


Several items of soft information are available. Some settings we use for density estimation 
are the same for all the examples in this chapter and are shown in Table 3.2. The soft 


information we introduce incrementally is given in Table 3.3. 
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Table 3.2: Settings for Estimates of Artificial Data 





Settings 
Mesh Cardinality, NV = 1000 
Support, [m9, my]: [0,23] 
Convolution Tolerance, 6 = 5 x 1073 
Convolution Values Checked: 101 














Table 3.3: Soft Information on Density for Epi-Spline Estimates 





Soft Information 
Continuous 
Unimodal 
Inflection Points, [I,,Jy]: [Y — 6sg,Y + 6se] 
Convex Tails 
Maximum Change in Gradient, A = .00125 


Y and Gs, represent the sample mean and estimated standard error. 














3.2 Objective Criteria 
The following sections document the performance of objective functions described in Sec- 
tion 2.3 in estimating fy. We use MSE as a metric for accuracy as described in Section 2.4. 


To verify our results, we have run a simulation experiment detailed in Section 3.3. 


3.2.1 Minimum Convolution Tolerance 
Convolution is a constraint in MLP-E and MEP-E, we can also minimize the tolerance 6. 


This problem takes the form 











mind s.t.aeA CR (3.3) 


ak ak 





and the terms of the problem follow from Chapter 2. With extremely small values of 6 
or very large numbers of y values checked, MLP-E and MEP-E become infeasible. By 
relaxing the convolution tolerance slightly, we can incorporate the solution of this problem 
as a parameter in MLP-E or MEP-E. 


3.2.2 Maximum Likelihood Estimation 
We describe the results of MLP-E for the estimate of the density fy and denote an optimal 
solution to this problem as fr. 
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We begin with the settings given in Table 3.2 and impose the continuity con- 
straint. The result, fx, appears as a red line in Figure 3.2 and contains several distinct 
“spikes.” Given the blue histogram shown in Figure 3.1, the user may reasonably believe 
that the underlying density is unimodal, with estimates for inflection points of approxi- 
mately one sample standard deviation of Y from the sample mean, Y. When we impose this 


constraint, the result improves dramatically as shown in Figure 3.2. 


Figure 3.2: Continuous and Unimodal Maximum Likelihood Estimates 
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The blue line shows a solution of MLP-E for 5000 observations of Y. The dashed line 
shows the density of the Gaussian errors. The black line shows the density calculated 
from kernel methods using three observations of X. The dashed green line shows the true 
Gamma density. The red line shows the solution of MLP-E given three observations of X 
and certain soft information. We begin with continuity on the left and impose the unimodal 
constraint on the right. With continuity, MSE = 6.765 x 10~!. The addition of the unimodal 
constraint improves the estimate dramatically; MSE = 9.3779 x 107+. 


The visual "staircase" effect in Figure 3.2 may incline the user to impose convex- 
ity on the tails. We obtain the result of imposing this additional constraint in Figure 3.3, 
which still contains a sharp peak caused by a lack of bounds on gradient changes. By 1m- 
posing such a bound, or a certain “smoothness,” in addition to previously given constraints, 
fe improves again. We document the numeric results of this evolving soft information in 
terms of MSE in Table 3.4. 


To demonstrate the impact of the convolution constraint, we compare the same fx 


of Figure 3.3 with an alternative produced without the convolution constraint. We show this 
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comparison in Figure 3.4. With convolution imposed as a constraint, the red line represent- 
ing fx has an extended right tail. Without the convolution constraint, the density appears 
nearly symmetric, producing a higher density estimate near the mode and eliminating the 
extended right tail. We see that the convolution constraint is key for identifying extended 


tail behavior. 


Figure 3.3: Tail Convex and “Smooth” Maximum Likelihood Estimates 
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The blue line shows a solution of MLP-E for 5000 observations of Y. The dashed line 
shows the density of the Gaussian errors. The black line shows the density calculated from 
kernel methods using three samples of X, shown as green circles. The dashed green line 
shows the true Gamma density. The red line shows the solution of MLP-E given three 
samples of X and certain soft information. On the left, we have a density estimate that is 
continuous, and unimodal with convex tails. MSE = 7.3031 x 10~*. On the right, we add 
“smoothness.” MSE = 1.7953 x 1074. 


3.2.3. Maximum Entropy Estimation 

Without samples of X, we can use MEP-E, ensuring “maximum ignorance” in density 
estimation. An optimal solution of MEP-E is the most dispersed density possible given fy 
(i.e., an optimal solution of MLP-E for 5000 observations of Y), f¢, and soft information. 
The orange line in Figure 3.5 shows the result of MEP-E. The dispersion of the density is 


evident in the right tail, which becomes apparently uniform and remains nonzero. 


The advantage of such a method is clear in non-artificial examples when no high- 
fidelity data is available. With artificial data we have the ability to generate samples of X 
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and therefore can compare the results of MLP-E, MEP-E, and minimum 6 for the same 
data and soft information. Table 3.5 shows a comparison of MSE and Figure 3.5 shows 
a graphical comparison. In a single trial, MEP-E produces the lowest MSE, but we offer 


repeated trials in Section 3.3 for further analysis. 


Table 3.4: Error Reduction Due to Increasing Soft Information 








Soft Information MSE 
Nonnegative Support 9.18 x 10°? 
Nonnegative Support, Continuous 6.765 x 107! 
Nonnegative Support, Continuous, Unimodal 9.3779 x 10-4 
Nonnegative Support, Continuous, Unimodal, Convex Tails 7.3031 x 10-4 
Nonnegative Support, Continuous, Unimodal, Convex Tails, “Smoothness” || 1.7953 x fo-* 

















The results of MLP-E show how increasing soft information reduces MSE, with the 
largest reduction achieved through the addition of the unimodal constraint. 


3.3 Numerical Experiments 

We show through repeated trials that our results do not appear by chance. We generate 30 
pairs of Y and X data sets (i.e., 5000 Y observations and three X observations), and obtain 
a solution of MLP-E for each pair. Separately we generate 30 sets of 5000 Y observations 
and obtain a solution fy of MEP-E for each trial. The data produced for each replication 
of each experiment is unique. We record the MSE of each trial and compute averages of 
MSE for both sets of replications. More detailed pseudocode describing the execution of 
these replications is available within Algorithm 1 of the Appendix. For all trials in these 


experiments we include all the soft information given in Table 3.3. 


We present our results in Figure 3.6. Consistent differences between the two 
methods become apparent in the right tails. Solutions of MEP-E often remain nonzero 
and become apparently uniform in the right tail. Solutions obtained via MLP-E exhibit 
more accurate right tail behavior. Soft information ensures that mean MSE for MLP-E and 
MEP-E do not vary widely on average. 
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Figure 3.4: The Effect of the Convolution Constraint 
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The blue line shows a solution of MLP-E for 5000 observations of Y. The dashed line shows 
the density of the Gaussian errors. The black line shows the density calculated from kernel 
methods using three samples of X, shown as green circles. The dashed green line shows 
the true Gamma density. The red line shows the solution of MLP-E given three samples 
of X and certain soft information. On the left, we include the convolution constraint, but 
on the right, we do not. By removing this constraint, we see that convolution is the key to 
right identifying tail behavior. 


Table 3.5: Error Results for Differing Criteria 








Problem MSE 

MLP-E 1.7953 x 10-4 

MEP-E 4.1694 x 1075 
Minimum Convolution Tolerance || 1.2398 x 10-4 

















MSE is shown for the same data and soft information. 
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Figure 3.5: Comparison of Objective Functions 
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The black dashed line shows the true density. The blue line shows a solution of MLP-E 
given three samples of X, shown as green circles. The green line shows a solution found 
by minimizing the convolution tolerance, 6. The orange line is a solution of MEP-E. Since 
the observations of X near or below the mode, MLP-E gives the lowest density in the right 
tails. The density is most dispersed throughout its support in MEP-E, which creates a near 
uniform nonzero right tail. Minimizing convolution tolerance underestimates the density at 
the mode. 
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Figure 3.6: Repeated Trials of Epi-Spline Estimates 
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The solid black lines show solutions of MLP-E for three random samples of X in (a), and 
solutions of MEP-E in (b). The dashed lines are solutions of MLP-E for 5000 observations 
of Y. The red lines show the true density. Each trial is performed on distinct data sets. 
Repeated trials of MLP-E and MEP-E show slight differences in right tail behavior. With 
MLP-E, right tails approach zero, while with MEP-E right tails often become nearly uni- 
form and nonzero. In (a), average MSE = 9.4 x 10->. In (b), average MSE = 1.14 x 10-4. 
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CHAPTER 4: 
Biostatistics Data 





We now turn to the application of MEP-E on a medical data set and a comparison with 
another deconvolution method. Density deconvolution literature is abundant within the 
field of biostatistics. Through the R package decon, we have obtained real data on systolic 
blood pressure (SBP) from the Framingham Heart Study (FHS) described in [23]. The data 


contain homoscedastic measurement errors. 


4.1 Deconvolution with R Package decon 

The authors of [25] provide explicit instructions to use their deconvolution method for 
density estimation. They use FHS data from 1,615 male subjects whose SBP is measured 
and recorded twice at a first visit and then again measured and recorded twice at a second 


visit eight years later. 


They compare the difference of the average SBP measurement at each visit for 
each subject and each difference is displayed in a histogram containing 1,615 differences 
within Figure 4.1. Measurement errors are assumed to be normally distributed based on 
the appearance of this histogram; therefore the differences are used to compute the stan- 
dard deviation of measurement errors. The assumption of normality is justified further 
within [23]. 


Their goal is to produce an accurate density of average SBP at the second visit 
by accounting for measurement error. The authors perform the deconvolution via a FFT 


algorithm, and their results on the SBP data are shown in Figure 4.1. 


4.2 Deconvolution of Systolic Blood Pressure Data Using 
Epi-Splines 
Our goal is to compare the results of MEP-E with the results achieved with decon. By 


incorporating estimated measurement error, the empirical data, and soft information, we 
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can achieve competitive results. The intent is to show that epi-spline estimates are capable 


of deconvolution in a non-artificial example. 


Figure 4.1: The R package decon Results 
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On the left, we see the histogram of the differences of the average SBP reading at each visit. 
On the right, we see the results of deconvolution via decon. The blue dashed line shows a 
kernel smoothing density and the black solid line shows the deconvolved estimated density. 


4.2.1 Setup 


To begin, we relate the terms described in [25] to the additive measurement model of Equa- 


tion (1.2). In decon, the numeric vectors containing each subject’s average SBP reading 


at the first and second visit are defined as SBP1 and SBP2, respectively. Since the goal 


is to deconvolve SBP2, we describe each subject’s second visit average SBP reading as 


an observation of Y. Therefore, we have 1,615 observations of Y, and we show summary 


statistics in Table 4. 


1. 


Table 4.1: Summary Statistics of SBP2 Observations 





Minimum 
87.5 





1st Quartile 
117.0 








Median 
126.5 





Mean 
130.0 





3rd Quartile 
139.5 


Maximum 
263.0 











No observations of X are available. Following the assumptions in [25] the esti- 
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mate of measurement error is given as 





€ ~ Normal(u = 0,0 = \/.5var(SBP1 — SBP2). (4.1) 


We first obtain a solution fy of MLP-E given 1,615 observations of Y. Second, 
we use fy and knowledge of measurement error to obtain an optimal solution fy of MEP-E. 
Table 4.2 provides the settings that remain unchanged for all estimates and Table 4.3 in- 


cludes the soft information we impose incrementally. 


Table 4.2: Settings for Estimates of SBP 





Settings 
Mesh Cardinality, NV = 1000 
Support, |779, my]: [80,265] 
Convolution Tolerance, 6 = 2 x 1073 
Convolution Values Checked: 101 














Table 4.3: Soft Information for Systolic Blood Pressure Density 





Soft Information 
Continuous 
Unimodal 
Inflection Points, [J7, Jy]: [110, 140) 
Convex Tails 
Maximum Change in Gradient, A = 3.75 x 10-5 














4.2.2 Evolving Soft Information 
We impose soft information in a sequence similar to Chapter 3. In all figures within this 
chapter, the blue lines show optimal solutions of MLP-E of the given noisy SBP data and 


the red lines show the deconvolved estimate from optimal solutions of MEP-E. 


We show the change in density estimates as we increase the soft information 
included in our estimate from continuity to unimodality in Figure 4.2. In a continuous es- 
timate, we see how MEP-E disperses the density with peaks at the lower and upper bounds 
of the support. Again, the effect of imposing the unimodal constraint is dramatic. The 


unimodal estimate reduces the density in the tails and increases the density at the mode, 
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but produces a “staircase” effect and a sharp peak. As shown in Chapter 3, we can elimi- 
nate these effects with the constraints of tail convexity and maximum changes in gradient, 


which we denote by A. We include all soft information in Section 4.3. 


Figure 4.2: Evolving Soft Information for Systolic Blood Pressure Density 
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The blue line is the estimate of the density of SBP readings by from the solution of MLP-E. 
The red line is the deconvolved estimate given by the solution of MEP-E. We begin with 
continuity on the left and add the unimodal constraint on the right. The sharp peaks of both 
estimates show the necessity of imposing a maximum change in gradient. 


4.3, Comparison of Deconvolution Methods 
With all soft information, the deconvolved epi-spline density is shown in Figure 4.3. Here 
we also show the kernel density with a black solid line and the deconvolved estimate from 
decon with a black dashed line. We note that decon computes a certain kernel bandwidth 
adjustment [25], so that we focus on the differences in the effects of deconvolution by each 
method, rather than the differences between corresponding densities. 


Both decon and MEP-E reduce the density in the tails and increase the density in 
the mode for exactly the same data. These changes in density are precisely those noted in 
the explanation of the R package decon in [25]. We see that deconvolution via epi-splines 


produces remarkably similar effects. 
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Figure 4.3: Deconvolution Method Comparison 
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Systolic Blood Pressure 
A comparison of our methodology on Systolic Blood Pressure data against a FFT method 
in the R package decon is shown. We compare the difference between the black lines from 
the decon package and the difference between the epi-spline estimates in blue and red. 
Both deconvolved estimates reduce density in tails and increase density at the mode. 
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CHAPTER 5: 
High-Fidelity and Low-Fidelity Simulation Output 





5.1 Hydrofoil Concept 

In this chapter we examine an example in UQ from fluid dynamics. We have received a 
data set from Dr. S. Brizzolara of the MIT SeaGrant program that contains the output of 
high-fidelity and low-fidelity fluid dynamics simulations for a hydrofoil concept displayed 
in Figure 5.1. The goal is to produce a method that can reduce the number of high-fidelity 
simulations required to produce an estimate of the density of high-fidelity performance 
by supplementing a few observations from a high-fidelity data set with many low-fidelity 


simulations. 


Figure 5.1: Hydrofoil Design Concept 





This image represents a design under development for an advanced high-speed hydrofoil. 
Fluid Dynamics simulations to compute characteristics such as Drag/Lift vary widely in 
fidelity and computational cost. The image and data provided for this experiment come 
courtesy of Dr. S. Brizzolara of the MIT SeaGrant program. 
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5.2 Computation Time 

A discussion of the difference in computation time is helpful to understand the desire to 
reduce the number of high-fidelity observations. The simulation calculates the drag/lift 
coefficient (D/L) using two differing methods. The high-fidelity method solves using the 
Navier-Stokes equations requiring four hours of computation time on eight cores for a 
single solution. By contrast, the low-fidelity solve uses the panel method, which only 
takes five seconds on one core. The data contain 898 pairs of high-fidelity and low-fidelity 
observations of D/L. We compare the data by fidelity in Figure 5.2. 


Figure 5.2: Comparison of High-Fidelity and Low-Fidelity Simulation Data 
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5.3 Comparison of Output Data 

Figure 5.2 shows that there appear to be two underlying relationships between high-fidelity 
and low-fidelity data. For much of the data, high-fidelity and low-fidelity results are clearly 
positively correlated, but many other points show that high-fidelity D/L is largely un- 
changed as low-fidelity D/L is increased. We refer loosely to these two patterns as hor- 


izontal and diagonal portions. 


We show kernel smoothing density estimates of high-fidelity data and low-fidelity 
data in Figure 5.3, computed in a manner described in Section 2.4. The low-fidelity kernel 
density is unimodal, but the high-fidelity kernel density is bimodal. Our goal is to identify 
the salient characteristics of the high-fidelity density by fusing together few high-fidelity 
samples and many low-fidelity samples in the following process, described in further detail 
in Algorithm 2 within the Appendix. 


5.4 Regression on a Sample for Deconvolution 
To relate this UQ scenario to Equation (1.2), we consider high-fidelity Drag/Lift as an 
observation of X and a low-fidelity Drag/Lift as an observation of Y. Using regression 
we may describe high-fidelity estimates in terms of low-fidelity data. We adopt the linear 
regression in the form 

X =Po+BY+e (5.1) 


and since errors are symmetric, we can write 
Bo + BY =X +e. (5.2) 
We use an affine transformation with the coefficients of regression to obtain a new variable 
Y’ = Bo + BY (5.3) 


so that 
Y'’=X+e. (5.4) 


Thus we can use observations of the variable Y’ to obtain a solution fy, of MLP-E to work 


towards a deconvolved solution. 
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Figure 5.3: Estimates Obtained by Kernel Smoothing 
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Kernel smoothing estimates show the disparity between results of sim- 
ulations of differing levels of fidelity. Red points indicate low-fidelity 
points, and the dashed-dotted line gives an estimated density given these 
low-fidelity points. Green points indicate high-fidelity points, and the 
solid line gives an estimated density given the high-fidelity points. Notice 
that the solid line is bimodal. High-fidelity outliers are also displayed. 


We may perform a single regression, naively, on a sample of the data to generate 
an estimate of measurement error. That is, from the regression line shown in Figure 5.4, we 
obtain a residual standard error Ges that we use as the estimate of the standard deviation 
of measurement errors. With the coefficients shown in Table 5.1 we transform 700 low- 
fidelity observations of Y into observations of Y’, shown by the red points in Figure 5.4. 
Additionally, we obtain 10 samples of high-fidelity data, shown by the green points, and 
obtain a solution fy of MLP-E, shown by the red line. We compare the result to a kernel 
smoothing estimate achieved using 898 samples of high-fidelity data. The estimate fy fails 
to capture the bimodal behavior of the density. A more sophisticated approach is required 


to obtain a better estimate. 
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Figure 5.4: Regression on a Sample for Deconvolution 
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On the left, we show a single linear regression on a sample of 100 points. On the right, 
we show the kernel smoothing estimated density of 898 high-fidelity points with the black 
line. The red line is the deconvolved solution of MLP-E given 700 transformed low-fidelity 
points 10 high-fidelity points. Performance of the estimate is significantly altered without 
partitioning of the subsample for multiple univariate regressions. The deconvolution causes 
the mode of the epi-spline density to fall to the left of the peak kernel smoothing density. 


Table 5.1: Single Regression Data 





Entire Sample 
Intercept, Bo: 7.982 x 10-7 
Slope, B: 9.473 x 107! 
Residual Standard Error, Gps: 2.604 x 10~+ 














5.5 Regressions on Partitions of a Sample for Deconvolu- 
tion 
Figure 5.5 shows a sample may preserve the horizontal and diagonal patterns discussed 
in Section 5.3. Though distinct patterns appear, the 14 points that fall in the intersection 
of these patterns present a certain ambiguity. We may assume that all points for which 
the high-fidelity D/L falls within a lower bound .0853 and upper bound .0857 belong to a 
horizontal portion of the data and the rest in the diagonal pattern. Alternatively, we may 
assume that the points that may be described as ambiguous (i.e., those that fall within the 
horizontal and diagonal patterns), should be distinct from those clearly in the horizontal 


pattern. 
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We record a weight p for the number of points in the horizontal portion divided 
by the sample size of 100. If ambiguous points are grouped with the other horizontal 
points, then 28 points fall in the horizontal pattern of 100 sampled points. Thus, p = .28, 
and p = .14 if the ambiguous points are grouped with the diagonal pattern, since 14 points 
fall in the ambiguous region. We weight density estimates of the horizontal and diagonal 


portions accordingly to produce a mixture. 


Figure 5.5: Regressions on Partitions of a Sample for Deconvolution 
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We sample 100 points without replacement from the data set containing all high-fidelity 
and low-fidelity simulation output. The diagonal and horizontal patterns are clear, but there 
is an ambiguous region. The scatterplot on the left visualizes the partition into horizontal 
and diagonal sections, with a regression over red and black points, and a separate regres- 
sion over the blue points. The bars within the dashed lines of the histogram represent those 
points assumed to be in the horizontal pattern. 


We compute a linear regression for the horizontal pattern and the diagonal pattern 
for each method of partitioning the sample of 100. Two of the four regressions we compute 
appear as two intersecting lines in Figure 5.5 given the partition that groups all the ambigu- 
ous points together with the horizontal points as described in Figure 5.5. Coefficients B 
and fo, and residual standard errors, Grsz, for all regressions are given in Table 5.2 and 
identified by the groups of points included for the particular regression. Within Figure 5.5 
and Table 5.2, the points that clearly fall along a diagonal pattern are described as diagonal 
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points, those clearly along the horizontal pattern are called horizontal points, and the points 


that do not clearly belong to a particular group are called ambiguous. 


Table 5.2: Partition Regressions 














Points Included Intercept, Bo Slope, B ORSE 
Horizontal and Ambiguous |] 8.281 x 10-7 | 3.280 10-7 | 1.114 10-4 
Diagonal 3.588 x 10-7 1.0006 1.819 x 10-4 
Horizontal 8.787 x 10-2 | —2.932 x 10°? | 8.704 x 10-5 
Diagonal and Ambiguous _ |] 2.982 x 1073 1.008 1270 x10" 




















For both partitions, the horizontal component density fn we assume to be 
Normal(ut = Y’,o = 6pse), where Y’ and 6gsz are found through the corresponding regres- 
sion on the horizontal portion. We perform no deconvolution on the horizontal component, 


but we perform deconvolution on the diagonal component in the following manner. 


We obtain 700 observations of Y’ using the coefficients from the diagonal regres- 
sion and a solution fy: of MLP-E. From the remaining 98 high-fidelity and low-fidelity 
pairs of observations, we sample 10 additional high-fidelity observations without replace- 
ment from outside the region containing the horizontal pattern. The Grsg of the diagonal 
regression we use as an estimate of measurement error. For this diagonal portion, with the 
given data and soft information for this portion shown in Table 5.3 we obtain a solution fra 
of MLP-E. The result is shown in Figure 5.6. 


Table 5.3: Settings and Soft Information for Hydrofoil Data 





Estimation Settings 
Mesh Cardinality, NV = 1000 
Support, |779, my]: [.0815, .0885] 
Convolution Tolerance, 6 = 1 x 10~? 
Convolution Values Checked: 11 
Continuous 
Unimodal 
Inflection Points, [Jz,Jy]: [.0845, .0855] 
Convex Tails 














Noticeable differences appear between the two component densities in Figure 5.6. 
The density at the mode for the horizontal portion is far greater than the peak density in the 
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diagonal component. The deconvolution of the diagonal portion, shown by the red line in 
Figure 5.6 reduces the density in the tails and increases the density in the mode, due to the 
observations of high-fidelity data points. 


Figure 5.6: Mixture Density Components 
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The left plot shows the assumed normal density for the horizontal portion. On the right, we 
have deconvolved a density estimate of the diagonal portion using residual standard error 
from regression of the diagonal portion as an estimate of the standard deviation of the noise. 
The dashed line shows the result of MLP-E on 700 low-fidelity observations. The red line 
shows the deconvolved solution of MLP-E on a sample of 10 high-fidelity observations 
displayed as green circles. 


5.6 Mixture Density Estimates 

Recall that in Section 5.4 we recorded a weighing factor, p, with which to weight the 
component densities estimated in Figure 5.6. Now that we have identified the densities 
based on the horizontal and diagonal portions of the data the result is that 


fx =pfxn+ (1 —p) fra: (5.5) 


We plot the mixture density when p = .28 against the kernel smoothing density based ex- 
clusively on high-fidelity data. The result is shown in Figure 5.7. The result is a bimodal 
estimate whose modes closely approximate the modes shown in the kernel smoothing esti- 


mate. 
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In Figure 5.7, we see that the mixture density in red overestimates the higher 
mode significantly and we compare mixtures produced through different partitions in Fig- 
ure 5.8. We see that the density at the higher mode is reduced at a cost of an increased 
density at the lower mode. The sharp peaks in the density may be reduced by imposing 


bounds in change of gradients as shown in Chapter 3 and Chapter 4. 


Figure 5.7: Mixture and Kernel Estimates 
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The dashed line shows the mixture density produced without deconvolution in the diagonal 
portion. The red line shows the deconvolved mixture density. The black line shows the 
kernel smoothing estimate of the high fidelity data. The red samples show observations of 
low-fidelity drag/lift transformed by regression. We see that a deconvolved mixture density 
captures the bimodal behavior of the data, but overestimates the density in the higher mode. 


The difference between the epi-spline mixtures and the kernel smoothing estimate 


must be considered in light of the difference in computation time for producing simulation 
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results. Table 5.4 shows how much computational expense would be required for each 


method of producing an estimate. Given the same number of cores, using an epi-spline 


mixture to acquire a density would allow for a reduction in computation time of over 87%. 


Table 5.4: Computational Expense of Differing Methods 



































Method |} X Count | Cores | Hours |} Y Count | Cores Hours Core-Hours 
Kernel 898 8 4 0 1 1.39 x 1073 28736 
Mixture 110 8 4 898 1 1.39 x 10-3 3521 














An epi-spline mixture using both high-fidelity and low-fidelity data can identify 


the characteristics of a bimodal distribution of high-fidelity data. Additionally, using such 


a mixture technique can save an enormous amount of computational expense when the 


difference in computation time between high and low-fidelity simulations is vast. 
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Figure 5.8: Mixture Estimates Generated From Different Partitions 
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The red line shows the mixture estimate that results from assuming the ambiguous points of 
a sample fall in the horizontal pattern. The blue line shows the mixture estimate that results 
from assuming the ambiguous points of a sample fall in the diagonal pattern. The black 
line shows the kernel smoothing estimate. The red line overestimates the density of the 
higher mode and the blue line overestimates the density of the lower mode. Both capture 
the bimodal behavior of the density. The sharp peaks can be eliminated with bounds on 


changes in gradient. 
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CHAPTER 6: 


Conclusion 





We have presented a new method of deconvolution using first-order epi-splines. To the 
best of the author’s knowledge, no other available deconvolution technique blends multiple 
data sets of varying fidelity and incorporates soft information to produce a single density 
estimate. All information available should be used for uncertainty quantification, and es- 


pecially in deconvolution, since highly accurate data is often extremely scarce. 


Initial results are promising for deconvolution via epi-splines. We show highly 
accurate results in the Gamma example for a single trial with all soft information included. 
Replications show that this level of accuracy is not achieved by chance. The SBP scenario 
gives an example where epi-spline deconvolution can produce meaningful results without 
any high-fidelity data. Our example with the hydrofoil data shows the potential for epi- 
splines to be used in data collection resource decisions. Epi-spline deconvolution provides 
an alternative to analyzing the results of high and low fidelity observations in a compart- 
mentalized fashion. Provided the user can obtain some knowledge of the accuracy of a low 
fidelity data set showing that measurement errors are Gaussian, epi-spline deconvolution 
can blend high and low fidelity data sets together for a viable estimate. Regression can 


serve as a tool for identifying an estimate of measurement error. 


Challenges remain for the development of first-order epi-splines for density es- 
timates. Widely available software makes use of automatic bandwidth selection in kernel 
smoothing density estimation, but we provide no such tool for selecting a hyperparame- 
ter A in the maximum change in gradient constraint described in Section 2.3.8. Similarly, 
the convolution constraint of Section 2.3.3 is well defined, but the tolerance and number 
of low fidelity density values to compare during deconvolution must be set manually. We 
recommend further analysis to obtain a reasonable selection for automatically identifying 
a well set 6 before attempting to expand the scope of epi-spline deconvolution to multi- 
variate densities or non-Gaussian errors. We remain confident that these challenges can be 


overcome, enabling epi-spline deconvolution for further use and widespread distribution. 
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APPENDIX: Computation 





A.1 Convolution Expression 
We can find a closed form solution of the integral in the right hand side of 





i Notte 1 yex—py? / 202 
Ay) = | ad ak + akx,)dx Vy (A.1) 
v4 ) u cs oV2n ( 0 k) 


by separating the expression into slope and intercept components and evaluating the inte- 


gral over every interval. We expand this integral to 


k k 
m atx —(y-x-p)* /20? L a9 —(y-x—p)? /20? 
e dx+ e dx. A.2 
eae Tessa a 


Through a combination of written methods and software such as the MATLAB Symbolic 
Toolbox [31] we find the closed form solutions of each addend in Equation (A.2). The 





slope component becomes 








a( = ( Mg—| p+ yert (V2) + Gert( HA **) 








2 oV2 
oO 7 (—my—p+y)? oO at (—mg_y-u+y)? (A.3) 
= a a e 26 
V2 V2 





y ome ees) y Guess 
+ let = ert 
2 ( oV/2 Z oV/2 
and the intercept component, notably simpler, becomes 
k 
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We use the error function 


etf(x) = = [ om ae (A.5) 
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A.2. Artificial Data 


First, we generate observations of random variables according to their respective densities 
using any capable software program. We store the sum of these two random variables 
in a .csv file which we use as input for a General Algebraic Modeling System (GAMS) 
program. In MLP-E, we produce an X data set. The GAMS program processes the data 
and produces output files which contain optimal solutions. Finally, these files are passed 


back to another program which plots the output and computes MSE. 


Algorithm 1: Replication of Optimized Epi-Spline Estimates 
Result: Mean Epi-Spline Estimate Error 
foreach Trial do 

Generate 5000 sums of X and € observations; 
Record each sum as Y data; 
if Using Maximum Likelihood Estimate then Record 3 new X samples; 
Input and process data into epi-spline estimating program; 
Input soft information for f, and fy; 
Solve for fA using MLP-E; 
Record value of fy) for 101 values of y; 
if X data recorded then 
Solve for f, using MLP-E; 
else 
if Maximum Entropy Desired then 

| Solve for f, usingMEP-E; 
else 

| Solve for ae minimizing convolution tolerance, 6; 
end 
end 
Export iy and f, results to post-processing program; 
Compute MSE; 
end 
Compute mean MSE; 
Display estimates and MSE as required; 
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A.3. Hydrofoil Concept 


We generate a bimodal distribution in the following process. First we sample the data to 
identify patterns and establish a partition of the sample. Two univariate regressions provide 
information as inputs to produce distinct component densities. Abundant low fidelity data 
supplements scarce high-fidelity data for a deconvolved component estimate. The mixture 
density produced is comparable to a kernel smoothing estimate of exclusively high-fidelity 
data. 


Algorithm 2: Developing an Epi-Spline Mixture Density 
Result: Mixture Density Estimate 
Collect 100 pairs of high-fidelity and low fidelity observations; 
Establish interval [a,b] in which horizontal portion falls; 
p = number of points for which x € [a,b]; 
foreach Component c do 
| Calculate Boc, B., and o-; 
end 
Set fy; = Normal(u = Bo, + 8, E(Y),0 = 6;); 
Collect 700 low fidelity observations; 
Set Y’ = Boa + BaY; 
Generate fyd using epi-splines with Y’ observations; 
Collect 10 high-fidelity observations X ¢ [a,b]; 
Deconvolve fy, using high-fidelity observations and oy to generate fxg; 
fx = pixn+(1—D) xa’ 
return fy 


A.4 Notes on Computation Time 

We solved for all epi-spline estimates on a mid-2012 MacBook Air® laptop with a 1.8 Ghz 
Intel Core 15 processor and 4GB 1600 Mhz DDR3 RAM using GAMS and the CONOPT 
solver. All solves took less than 1-2 minutes. The numerical experiments of 30 replications 
in Section 3.3 took less than 8 minutes each. 
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